2  String, Dates, and Joins

“Especially in football, it is not a QB’s game… even though the media likes to make it into that - it takes the whole team.” - Joe Montana

“The Catch”: Dwight Clark catches a pass from Joe Montana resulting in a touchdown to give the 49ers a 28-27 victory over the Dallas Cowboys.

This section demonstrates how to manage and analyze sports data using string manipulation and date-time operations. We’ll explore tools from the tidyverse in R, using basketball-related datasets like pgfreethrow.csv and other datasets available in R packages.

2.1 String Manipulation

String manipulation is essential when handling categorical data or extracting information embedded within text. Using the stringr package from tidyverse, we can efficiently handle text data. This includes cleaning up inconsistent formatting, extracting specific patterns, and transforming data into analyzable formats. For example, when dealing with play results, you might need to distinguish between various outcomes or classify performance metrics based on embedded textual descriptions.

Efficient string manipulation also allows for seamless preprocessing of large datasets, ensuring that raw, unstructured text data can be transformed into usable forms. With tools like regular expressions and the functionality of stringr, you can extract numerical details from text, standardize categories, and even generate new variables based on patterns within your data. These transformations are vital for downstream analyses, including statistical modeling and visualization.

Moreover, combining stringr with other tidyverse functions facilitates a unified workflow. This integration ensures that data wrangling and string processing can occur in a pipeline, reducing redundancy and enhancing reproducibility in sports analytics projects.

2.1.1 Common stringr Functions

The stringr package provides a comprehensive set of tools for text manipulation. Below are some commonly used functions:

  • str_detect(): Checks for the presence of a pattern in a string and returns a logical value. This is useful for categorizing or flagging data based on text content.

  • str_extract(): Extracts the first occurrence of a specified pattern. Ideal for pulling out specific details like numbers or substrings embedded in text.

  • str_replace(): Replaces occurrences of a pattern with a new string. Useful for standardizing text formats or correcting inconsistencies.

  • str_split(): Splits strings into components based on a delimiter. This is helpful when parsing structured text like CSV fields embedded in a string.

  • str_length(): Returns the number of characters in a string. Useful for quality checks or trimming operations.

By leveraging these string manipulation methods, analysts can address challenges like inconsistent data entry or embedded patterns within text. These methods not only enhance the quality of the dataset but also open up possibilities for deeper and more nuanced analyses. Ultimately, they enable a smoother transition from raw data to actionable insights, which is especially critical in high-volume datasets.

Example 2.1 (Determine Free Throw Outcome) The pgfreethrow.csv dataset contains free throws shot by Paul George during his career through the 2021-2022 season. The data was obtained from https://www.basketball-reference.com. The data contains a column playresult, which logs whether a free throw is a Make or a Miss along with the shot number. We’ll extract and classify this information but first let’s examine the dataset.

library(tidyverse)

dat = read_csv("data/pgfreethrow.csv")

dat |> glimpse()
Rows: 4,170
Columns: 9
$ date       <chr> "10/27/2010", "10/27/2010", "11/9/2010", "11/9/2010", "11/1…
$ vis_team   <chr> "Indiana Pacers", "Indiana Pacers", "Denver Nuggets", "Denv…
$ home_team  <chr> "San Antonio Spurs", "San Antonio Spurs", "Indiana Pacers",…
$ quarter    <chr> "3", "3", "4", "4", "4", "1", "1", "3", "2", "2", "3", "3",…
$ time       <time> 04:26:00, 04:26:00, 06:50:00, 06:50:00, 00:26:00, 00:38:00…
$ court      <chr> "away", "away", "home", "home", "home", "home", "home", "ho…
$ playresult <chr> "makes free throw 1 of 2", "makes free throw 2 of 2", "miss…
$ away_score <dbl> 83, 84, 93, 93, 78, 20, 20, 47, 30, 30, 76, 77, 80, 81, 106…
$ home_score <dbl> 81, 81, 124, 125, 107, 24, 25, 67, 24, 25, 77, 77, 77, 77, …

We first want to determine if the free throw attempt was a Make or a Miss. We can use str_detect in combination with if_else.

# Determine the outcome of each free throw
dat = dat |>
  mutate(
    outcome = if_else(str_detect(playresult, "make"), "make", "miss")
  )

dat |> glimpse()
Rows: 4,170
Columns: 10
$ date       <chr> "10/27/2010", "10/27/2010", "11/9/2010", "11/9/2010", "11/1…
$ vis_team   <chr> "Indiana Pacers", "Indiana Pacers", "Denver Nuggets", "Denv…
$ home_team  <chr> "San Antonio Spurs", "San Antonio Spurs", "Indiana Pacers",…
$ quarter    <chr> "3", "3", "4", "4", "4", "1", "1", "3", "2", "2", "3", "3",…
$ time       <time> 04:26:00, 04:26:00, 06:50:00, 06:50:00, 00:26:00, 00:38:00…
$ court      <chr> "away", "away", "home", "home", "home", "home", "home", "ho…
$ playresult <chr> "makes free throw 1 of 2", "makes free throw 2 of 2", "miss…
$ away_score <dbl> 83, 84, 93, 93, 78, 20, 20, 47, 30, 30, 76, 77, 80, 81, 106…
$ home_score <dbl> 81, 81, 124, 125, 107, 24, 25, 67, 24, 25, 77, 77, 77, 77, …
$ outcome    <chr> "make", "make", "miss", "make", "make", "make", "make", "ma…

2.1.2 Regular Expressions

Regular expressions (regex) are a powerful tool for identifying patterns in text. They allow you to define complex search patterns using a concise syntax, making them ideal for string extraction, validation, and manipulation. In R, the stringr package seamlessly integrates regex functionality, enabling analysts to perform advanced text operations efficiently.

For example, a regex pattern like "\\d of \\d" can identify phrases like “1 of 2” in a dataset, extracting structured information embedded within text fields. Regex is particularly useful for parsing datasets with varying formats or extracting subsets of data based on specific conditions. Combined with stringr functions like str_detect, str_extract, and str_replace, regex becomes an essential part of your data preprocessing toolkit.

While regex might seem daunting at first, breaking patterns into smaller components and testing them incrementally can make them more approachable. Mastery of regex not only improves data cleaning workflows but also enhances the accuracy and depth of data extraction, empowering analysts to unlock insights that would otherwise remain hidden.

Common regex patterns

Here are some commonly used regex patterns and their functions:

  • ^: Matches the start of a string. Useful for identifying strings that begin with a specific pattern. Example: ^Make matches any string starting with “Make”.

  • $: Matches the end of a string. Useful for ensuring a string concludes with a particular sequence. Example: 2$ matches strings ending in “2”.

  • .: Matches any single character except a newline. This is helpful for flexible matching. Example: M.ke matches “Make”, “Mike”, or “Moke”.

  • \d: Matches any digit. Useful for extracting numeric values. Example: \d of \d matches strings like “1 of 2”.

  • []: Matches any character inside the brackets. Use this to define a character set. Example: [Mm]ake matches “Make” and “make”.

  • +: Matches one or more occurrences of the preceding element. Example: \d+ matches any sequence of digits, such as “123” or “7”.

By combining these patterns, you can create robust expressions for almost any text-processing need. Regular expressions are incredibly versatile and, when used effectively, can transform the way you handle text data.

More examples

Here are additional examples of regex applications using some simple toy datasets.

  1. Extract Player Names

    Suppose we have player names in the format “Last, First”. Using regex, we can separate them into first and last names.

    players = tibble(name = c("James, LeBron", "Curry, Stephen", "Durant, Kevin"))
    
    players = players |>
      mutate(
        first_name = str_extract(name, "(?<=, ).*"),
        last_name = str_extract(name, "^.*(?=,)")
      )
    
    players
    # A tibble: 3 × 3
      name           first_name last_name
      <chr>          <chr>      <chr>    
    1 James, LeBron  LeBron     James    
    2 Curry, Stephen Stephen    Curry    
    3 Durant, Kevin  Kevin      Durant   
  2. Identify Scoring Plays

    In game logs, scoring plays may include text like “scores 3 points” or “makes a free throw”. Use regex to flag all scoring events.

    plays = tibble(event = c("LeBron scores 3 points", "Curry makes a free throw", "Durant misses shot"))
    
    plays = plays |>
      mutate(scoring = str_detect(event, "scores|makes"))
    
    plays
    # A tibble: 3 × 2
      event                    scoring
      <chr>                    <lgl>  
    1 LeBron scores 3 points   TRUE   
    2 Curry makes a free throw TRUE   
    3 Durant misses shot       FALSE  
  3. Parse Complex IDs

    For datasets with IDs like “P123-456-789”, extract the numeric components for further analysis.

    ids = tibble(player_id = c("P123-456-789", "P987-654-321"))
    
    ids = ids |>
      mutate(
        part1 = str_extract(player_id, "(?<=P)\\d+"),
        part2 = str_extract(player_id, "(?<=-)\\d+(?=-)"),
        part3 = str_extract(player_id, "\\d+$")
      )
    
    ids
    # A tibble: 2 × 4
      player_id    part1 part2 part3
      <chr>        <chr> <chr> <chr>
    1 P123-456-789 123   456   789  
    2 P987-654-321 987   654   321  

2.2 Working with Dates

Analyzing performance trends often involves working with date-time variables. Dates might be stored as text, requiring conversion to a proper date format.

2.2.1 Common lubridate Functions

The lubridate package is an essential tool in R for working with date and time data. Below are some of the most commonly used functions and their applications:

  • ymd() / mdy() / dmy(): These functions parse text strings into date objects, assuming the formats “Year-Month-Day”, “Month-Day-Year”, or “Day-Month-Year”, respectively.

    Example:

    ymd("2023-12-29")  # Returns a date object: "2023-12-29"
    [1] "2023-12-29"
    mdy("12-29-2023")  # Returns a date object: "2023-12-29"
    [1] "2023-12-29"
  • year(), month(), day(): Extract specific components (year, month, or day) from a date object. The month() function can also return month names if label = TRUE is used.

    Example:

    date = ymd("2023-12-29")
    year(date)   
    [1] 2023
    month(date)  
    [1] 12
    day(date)    
    [1] 29
  • hour(), minute(), second(): Extract time components from a datetime object.

    Example:

    datetime = ymd_hms("2023-12-29 15:45:30")
    hour(datetime)  
    [1] 15
    minute(datetime) 
    [1] 45
    second(datetime)
    [1] 30
  • today() and now(): Retrieve the current date or datetime in the system’s timezone.

    Example:

    [1] "2025-02-24"
    now()  
    [1] "2025-02-24 18:44:15 CST"
  • floor_date(), ceiling_date(), round_date(): Round dates to the nearest unit such as day, month, or year.

    Example:

    date = ymd("2023-12-29")
    floor_date(date, "month")  
    [1] "2023-12-01"
    ceiling_date(date, "month")  
    [1] "2024-01-01"
  • interval() and duration(): Create and manipulate intervals or durations. These functions are useful for calculating time differences.

    Example:

    start = ymd("2023-01-01")
    end = ymd("2023-12-31")
    int = interval(start, end)
    int_shift(int, duration(days = 14))
    [1] 2023-01-15 UTC--2024-01-14 UTC

By understanding and using these functions, you can seamlessly handle date-time data in sports analytics, enabling you to analyze performance trends, calculate durations, and identify patterns over time.

Example 2.2 (Converting to dates)  

library(lubridate)

# Convert game_date to Date type
dat = dat |>
  mutate(date = mdy(date))

# Display the structure of the dataset
dat |> glimpse()
Rows: 4,170
Columns: 10
$ date       <date> 2010-10-27, 2010-10-27, 2010-11-09, 2010-11-09, 2010-11-18…
$ vis_team   <chr> "Indiana Pacers", "Indiana Pacers", "Denver Nuggets", "Denv…
$ home_team  <chr> "San Antonio Spurs", "San Antonio Spurs", "Indiana Pacers",…
$ quarter    <chr> "3", "3", "4", "4", "4", "1", "1", "3", "2", "2", "3", "3",…
$ time       <time> 04:26:00, 04:26:00, 06:50:00, 06:50:00, 00:26:00, 00:38:00…
$ court      <chr> "away", "away", "home", "home", "home", "home", "home", "ho…
$ playresult <chr> "makes free throw 1 of 2", "makes free throw 2 of 2", "miss…
$ away_score <dbl> 83, 84, 93, 93, 78, 20, 20, 47, 30, 30, 76, 77, 80, 81, 106…
$ home_score <dbl> 81, 81, 124, 125, 107, 24, 25, 67, 24, 25, 77, 77, 77, 77, …
$ outcome    <chr> "make", "make", "miss", "make", "make", "make", "make", "ma…

Suppose we want to determine if Paul George has different free throw percentages for different months of the year. Let’s extract the month component of the date so that we can summarize.

Example 2.3 (Extract Year and Month)  

# Extract year and month
dat = dat |>
  mutate(
    year = year(date),
    month = month(date, label = TRUE)
  )

# Summarize free throw percentages by month
dat_summary = dat |>
  group_by(month) |>
  summarize(
    total_shots = n(),
    makes = sum(outcome == "make"),
    free_throw_percentage = makes / total_shots * 100
  )

# Display summary
dat_summary
# A tibble: 12 × 4
   month total_shots makes free_throw_percentage
   <ord>       <int> <int>                 <dbl>
 1 Jan           642   545                  84.9
 2 Feb           513   434                  84.6
 3 Mar           695   583                  83.9
 4 Apr           598   507                  84.8
 5 May           257   197                  76.7
 6 Jun           143   117                  81.8
 7 Jul             4     2                  50  
 8 Aug            48    44                  91.7
 9 Sep            33    29                  87.9
10 Oct           127   105                  82.7
11 Nov           479   411                  85.8
12 Dec           631   544                  86.2

By exploring free throw performance by month, we can identify seasonal trends or patterns in player accuracy. Understanding such patterns can be invaluable for game preparation and performance analysis.

2.3 Combining Data Frames

In sports analytics, combining datasets is often necessary to enrich analyses by merging relevant data sources. The dplyr package in R provides several join functions to combine data frames based on common columns (keys). These include:

2.3.1 Types of Joins

  • inner_join(): Returns rows with matching keys in both data frames.
  • left_join(): Returns all rows from the first data frame and matching rows from the second data frame.
  • right_join(): Returns all rows from the second data frame and matching rows from the first data frame.
  • full_join(): Returns all rows from both data frames, with NA for unmatched rows.
  • semi_join(): Filters rows from the first data frame that have matches in the second data frame but doesn’t include columns from the second data frame.
  • anti_join(): Filters rows from the first data frame that do not have matches in the second data frame.

2.3.2 Examples: Combining Basketball Data

Let’s illustrate these joins using two basketball-related datasets:

  1. players contains player names and their team IDs.
  2. teams contains team IDs and their names.

The Data

players <- tibble(
  player_id = 1:4,
  player_name = c("LeBron James", "Stephen Curry", "Kevin Durant", "Chris Paul"),
  team_id = c(1, 2, 3, 4)
)

teams <- tibble(
  team_id = c(1, 2, 3, 5),
  team_name = c("Lakers", "Warriors", "Nets", "Knicks")
)

Inner Join

Keeps only players with matching team IDs in the teams table.

inner_join(players, teams, by = "team_id")
# A tibble: 3 × 4
  player_id player_name   team_id team_name
      <int> <chr>           <dbl> <chr>    
1         1 LeBron James        1 Lakers   
2         2 Stephen Curry       2 Warriors 
3         3 Kevin Durant        3 Nets     

Left Join

Keeps all players, adding team names where possible.

left_join(players, teams, by = "team_id")
# A tibble: 4 × 4
  player_id player_name   team_id team_name
      <int> <chr>           <dbl> <chr>    
1         1 LeBron James        1 Lakers   
2         2 Stephen Curry       2 Warriors 
3         3 Kevin Durant        3 Nets     
4         4 Chris Paul          4 <NA>     

Full Join

Combines all rows from both datasets.

full_join(players, teams, by = "team_id")
# A tibble: 5 × 4
  player_id player_name   team_id team_name
      <int> <chr>           <dbl> <chr>    
1         1 LeBron James        1 Lakers   
2         2 Stephen Curry       2 Warriors 
3         3 Kevin Durant        3 Nets     
4         4 Chris Paul          4 <NA>     
5        NA <NA>                5 Knicks   

Semi Join

Filters players to include only those on teams listed in teams.

semi_join(players, teams, by = "team_id")
# A tibble: 3 × 3
  player_id player_name   team_id
      <int> <chr>           <dbl>
1         1 LeBron James        1
2         2 Stephen Curry       2
3         3 Kevin Durant        3

Anti Join

Filters players to include only those not matched in teams.

anti_join(players, teams, by = "team_id")
# A tibble: 1 × 3
  player_id player_name team_id
      <int> <chr>         <dbl>
1         4 Chris Paul        4

Understanding and using these joins effectively is critical for analyzing sports data where relationships between entities, such as players and teams, are vital.

2.4 Concept Quiz

  1. What function in stringr is used to detect the presence of a pattern in a string?
  2. What is the purpose of the str_replace() function?
  3. What regex pattern matches one or more digits?
  4. Which lubridate function parses dates in the format “Year-Month-Day”?
  5. What does the interval() function in lubridate do?
  6. Which join in dplyr returns rows that have matches in both data frames?
  7. Which join ensures all rows from both data frames are included?
  8. If players has 6 rows and teams has 3 matching rows, how many rows will the result of inner_join(players, teams, by = "team_id") contain?